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ARRAY PROCESSOR 



BACKGROUND 



Developed by FSD for Navy 
Part of Proteus Sonar System 
Integer Machine for FFT 
Interest by DP for 2938 Follow-On 
PASC Application Studies 
Elser Task Force 
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ARRAY PROCE SS OR 



HARDWARE 



1 Megabyte Bulk Store (24K points, 1M-purchase) 
24/48 bit Fraction 

Arithmatic Element 

2 Adders 

1 Multiplier 
Highly Pipelined 
100 ns Cycle 

2-1000 Word Working Stores 
Microprogrammed 
Short/Long Precision 
No Error Checking 

Control Processor 

2 Micro Second Cycle 
Controls Data Transfers 

Host to Bulk Store (3m B/S) 
Bulk Store to Working Store (40m B/S) 
Provides Overlap 

370 Channel Interface 
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IBM CONI.' 1DENTIAL 



BLOCK DIAGRAM 



S/370 
HOST 
COMPUTER 



QUEUES MULTIPLE TASKS 

UP TO SEVEN SUBCHANNELS SHARE DEVICE 




^ FIXED POINT AND FLOATING POINT DATA , ARRAY PROCESSOR 

PROGRAMS 



S/370 BLOCK 
MULTIPLEXOR 
CHANNEL INTERFACE 



BULK 
STORE 



USER PARTITIONED STORAGE 
256K - 1024K BYTES 



I 



CONTROL 
PROCESSOR 



SEQUENCE 
CONTROL 



STORAGE 

TRANSFER 

CONTROL 



PAGING OF DATA BETWEEN 
STORAGE AND ARITHMETIC 
ELEMENT 



ARITHMETIC 
ELEMENT 



HIGH SPEED VECTOR AND 
MATRIX PROCESSOR 
FLOATING POINT 
ALGORITHM CONTROLLED 
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MICROPROGRAM 
STORE 



1_ 



ARITHMETIC 
ELEMENT 



-t 



FIGURE 1. ARRAY PROCESSOR 
3. 
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SOFTWARE 



AE - Microcode 

CP - SPL (370 BAL-Like) ' 

Controls Data Transfer 
AE Scheduling 
Multiprogramming 

Host - VPAM 

VPAM is 2938, APAM Follow-On 
User Program 

Overhead Estimates 

1,8 ms - Initialization 
1 ms - Branching 

300 ms - Initialization per Algorithm 
Binding may Eliminate 
Overlapped with AE and 10 
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APPLICATION STUDIES 



Nuclear Reactor Diffusion Equation 
Golub-Varga-Tridiagonal System 
50-75% of Running Time 
Special Microcode 
AP is 2-3 X 168 
Bulk Store Limitation 

Atmospheric Radiation 

Matrix Multiply - AP ix 5-10 X 168 
Matrix Inversion (LU) - AP is 7 X 168 
80% or more in AP 

Plasma Computation 

Vlasov-Poisson Equations 

ASD Method (FFT) 

56% of Computation is FFT 
24% Vector OPS 

80% can be done in AP 

AP 5 X 168 
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NOAA Weather Model 

Already Vectorized 

80% OR more in AP 

AP 1-2 X 168 using APAM 



European Weather Center 

J. Hague - UK 
Microcode Approach 
50 Algorithms 
100 Man Month Estimate 
20 MIPS (6 X 168) 



Earth Resource 

Digital Filtering 
FFT 

ERTS Data 



Seismic 

Vector OPS 
Vector OPS 
FFT 

Overall 



10-20 X 2938 
2-10 X 168 

24 X 168 
4 X 168 ' 
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Timing Comparisons 

29 38/ 

2938 Gusher Gusher 

Case A 8.38 3.17 2.65 

B 12.53 4.62 2.71 

C 41.44 6.34 6.54 

Dl 615.57. " 27.73 22.2 

D2 825.84 32.54 25.4 

E 59.07 9.51 6.21 

Fl 25.06 9.51 2.64 

F2 82.85 10.96 7.56 

F3 48.96 6.34 7.72 

F4 97.89 10.96 8.9 3 

G 82.88 , 6.34 13.07 

Case A Autocorrelation 640 point window, 64 output points 

B Deconvolution (short filter) 1500 pt trace, 32 pt filter 

C Band pass filter (long filter) 1500 pt trace, 125 pt filter 

Dl Vibroseis (Step 1) Cross Correlation, 4000 pt Window, 
201 output pts . 

D2 Vibroseis (Step 2) Cross Correlation, 7000 pt Window, 
3000 output pts. 

E Time Variant Filter - three 500 pt Windows, 150 pt overlap, 

125 pt filter 
Fl Filtering 3000 pt trace, 32 pt filter 
F2 Filtering 3000 pt trace, 125 pt filter 
F3 Filtering 1500 pt trace, 150 pt filter 
F4 Filtering 3000 pt trace, 150 pt filter 
G Filtering 1500 pt trace, 250 pt filter 
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ALGORITHM EXECUTION TIME 



{yus unless noted) 



ALGORITHM 



2938 
370/168 



3838 
min/max range 



Vector Element Multiplication 


3 . 75N 





. 3N - 





. 6N 


Vector Element Sum 


3 . 7.5N 





. 3N - 





. 6N 


Scalar Multiply 


2.475N 





. 3N - 





. 5N 


Signed Square Array 


2.475N 





. 3N - 





. 5N 


Sum of Squares 


2.4N 





. IN - 





. 2N 


Sum of Vector Elements 


2 . 4N 





. IN - 





. 2N 


Vector Inner Product 


2.55N. 





. 2N - 





. 4N 


Convolving Multiplication 


. 2N 





.IN - 





. 2N 


Complex Multiply 


3.75N 





. 6N - 


1 


. 2N 


Difference Equation 


4 .6N 


1 


• IN - 


1 


. 2N 


Interpolate 


12N 


3 


. 4N - 


3 


. 5N 


Partial Matrix Multiplication 


3.75N 


- 


.UN 






FFT ( 1024. Points, Complex) 


26.6ms 


2 


• 66ms 






FFT (1024 Points, Real) 


NDA 


1 


. 43ms 






Vector Move Convert 


. 2.475N 





.2N - 





. 4N 


Vector Floating to Fixed 


2.7N 





.2N - 





. 4N 


Divide 


NA 




.85N 




1.1! 


Square Root 


NA 


3 


. ON - 


3 


. 2N 



Note: NA - not available on 2 938 

NDA - not directly available 
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CURRENT ACTIVITY 



ELSER TASK FORCE MEETING - 1/27/76 

PHASE II REVIEW - 3/76 

IDENTIFY SCIENTIFIC APPLICATIONS 
IDENTIFY CUSTOMERS 
SPECIFY SCIENTIFIC ALGORITHMS 

DESCRIBE SUPPORT 

* SUPPORT GROUP 

RPQ PROGRAMMING SERVICES 
CUSTOMER EDUCATION 
SOFTWARE PRODUCTS 
DEMONSTRATIONS 

INVESTIGATE 

* SPARSE MATRICIES 
LINEAR PROGRAMMING 

* PARABOLIC PDE 

* PIC 
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P DQ7V2 



PDQ7V2 FDP ANN 6/75 

VERSION 1 MODIFICATION 1 - 1/76 
OS/HPAM DISK ERROR RECOVERY 

* 30% PERFORMANCE IMPROVEMENT (INPUT) 
CORRECTION OF MINOR PROBLEMS 

* TIMING COMPARISON WITH PDQ7/17 - PDOM 

TIMING COMPARISON 

* 24 TYPICAL PROBLEMS 

* 10-25% FASTER CPU TIME - 2D 

* 2-5 TIMES FASTER CPU TIME - 3D 

FIRST CUSTOMER EXPERIENCE 

* 30% PERFORMANCE IMPROVEMENT OVER PDQ7/17 
RUNNING 7 HOUR 3D ON 165 

* USING NEW FEATURES 

* ACCURACY BETTER THAN 1/4% 
PROJECTING 370/168 ' 
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IBM vs COMPETITION 



.1 



STATUS IN U.S. 

NATIONAL LABS 
HIGH ENERGY PHYSICS 
PLASMA PHYSICS ^ 
WEAPON DEVELOPMENT 
WEATHER BUREAU 
REACTOR MANUFACTURERS 
MANUFACTURING 



370/168, 195, CDC/7600 

CDC/7600 
370/195 

CDC/7600 + GE/635 
1108, 360, 370, CDC 



PERFORMANCE 



RELATIVE RUN TIME 



MACHINE 

7600 
195 
168MP 
168UP 
158MP 
158UP 
145 



CPU 

1 
1 

1.1-1.2 

2 

5-6 

10 

25 



ELAPSED TIME 
1 

0.8 - 1.0 

1.3-1.5 

1.8-2.2 

5-7 
9+12 
20-30 



PDQ7V2 

1 
1 

1-2.2 
1-2.2 

10 
30 



SERVICE BUREAUS 
CYBERNET 
INFONET 
OTHERS 



CDC 

CSC (1108) 
IBM 



HARDWARE INSTALLED IN U.S. 



GOVERNMENT 

ARGONNE NATIONAL LAB 
OAKRIDGE NATIONAL LAB 
BROOKHAVEN NATIONAL LAB 
SAVANAH RIVER NATIONAL LAB 
LOS ALAMOS NATIONAL LAB 
HANFORD NATIONAL LAB 
NATIONAL REACTOR TEST STATION 
BETTIS 
KAPL 

LIVERMORE 

SANDIA 

SLAC 

PRINCETON (PLASMA) 
UCLA (PLASMA) 
WEATHER BUREAU 

REACTOR MANUFACTURERS 
GENERAL ELECTRIC 
WESTINGHOUSE 
BABCOCK & WILCOX 
COMBUSTION ENGINEERING 



50,75,195 
75,91 
7600 

195 
4-7600 
CYBER 73 
75 
7600 
7600 
4-7600, STAR 
7600 
91, 2-168 
91 
195 
2-195 



2-GE635 
2-760-0, (IBM) 
7600 

7600, (158-168) 



ELECTRIC UTILITIES 
95% - IBM 



RELATIVE HARDWARE PERFORMANCE 
SCIENTIFIC COMPUTING 



MAPUTMC" 

IIALrUNt 


CPU SPEED/168 UP 




SCALAR MODE 


VECTOR MODE 


CRAY 1 


10 


14-? 


CDC 7600 


2 


2-5 


IBM 195 


2 


2-3 


CYBER 175 


1.5-2.1 


2-4 


Amdahl mm 


1-2 




IBM 168 AP 


(1.6-1.8) 




IBM 168 MP 


(1.5-1.7) 




IBM 168 UP 


1.0 




IBM 158 MP 


C2/5-1/2) 




IBM 158 UP 


1/5-1/3 




IBM 145 


1/15-1/20 





APPENDIX D BENCHMARK JOBS - RELATIVE PERFORMANCE 

System 

1. IBM 360/75 using FORTRAN H with optimization 

2. IBM 360/75 using FORTRAN G - no optimization 

3. IBM 370/158 using FORTRAN H with optimization 

4. IBM 370/168-1 using FORTRAN H with optimization, no high speed multiply 

feature, small cache 

5. IBM 370/168-III using FORTRAN H with optimization, with high speed multiply 

feature, large cache 

5a. IBM 370/168-1 using FORTRAN H with optimization, with high speed multiply 
feature and large cache 

. 6. CDC CYBER 173 

7. CDC CYBER 175 

8. AMDAHL 470V6 - using IBM FORTRAN H with optimization 

9. BURROUGHS B7700 

10. DEC KL10 using F10 with optimization 

11. UNIVAC 1100/40 

RELATIVE PERFORMANCE (System 1 = 1.0) TOTAL CPU TIME 





1 


2 


3 


4 


5 


5a 


6 


7 


8. 


9 


10 


n 


Job 1 


1.00 


.84 


..86 


2.33. 


4.26 


4.15 


1.72 


8.18 


4.81 


.54 


NR 


1.46 


Job 2 


1.00 


.53 


.86 


2.64 


4.59 


4.36 


.97 


11.41 


4.60 


.82 


.53 


1.34 


Job 3 


1. 00 


1.00 


1.11 


3.14 


3.36 


3.36 


.41 


1.44 


6.60 


NR 


NR 


NR 


Job 4 


1.00 


.47 


..81 


3.25 


3.56 


3.50 


.99 


8.50 


4.83 


.84 


1.11 


1.99 


Job 5 


1.00 


.80 


1.08 


3.26 


3.77 


3.78 


NR 


NR 


5.81 


NR 


NR 


NR 



NR - not run 

NOTE: The five jobs were run as an informal benchmark. Results are indicative, but 

not definitive since running conditions - e.g. - standalone vs. multi programmed 
were not controlled. ' 



Job 1 : Author: Dr. Arnett - Astronomy 

This is a large compute bound problem, written in FORTRAN. All 
calculations are done in double precision except on CDC equipment where single 
60-bit precision is adequate. (It is estimated that performance would degrade 
about 10% if CDC used double precision.) 

Job 2: Author: EDUCOM Benchmark 

This is a small FORTRAN program doing double precision matrix multiply. " 
It tests multiply, add and loop control. 

Job 3: Author: Dr. Michalski - Computer Science 

This is a large and complex PL/I program using bit manipulation. It is 
both a test of compiler integrity and computer power. 

Job 4: Author: Dr. Wagstaff - Mathematics 

This is an intensive test of integer arithmetic on a number theory problem 
in FORTRAN. 

Job 5: Author: Dr. Brown - Mathematics 

This is an extended precision arithmetic program testing both integer 
arithmetic and character manipulation.- Code is in both FORTRAN and Assembler. 



